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ABSTRACT 

The massive semantic data sources linked in the Web of Data 
give new meaning to old features like navigation; introduce 
new challenges like semantic specification of Web fragments; 
and make it possible to specify actions relying on semantic 
data. In this paper we introduce a declarative language to 
face these challenges. Based on navigational features, it is 
designed to specify fragments of the Web of Data and actions 
to be performed based on these data. We implement it in 
a centralized fashion, and show its power and performance. 
Finally, we explore the same ideas in a distributed setting, 
showing their feasibility, potentialities and challenges. 
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1. INTRODUCTION 

Classically the Web has been modelled as a huge graph 
of links between pages 4 . This model included Web fea- 
tures such as links without labels and only generated by 
the owner of the pageQ Although Web pages are created 
and kept distributively, their small size and lack of struc- 
ture stimulated the idea to view searching and querying 
through single and centralized repositories (built from pages 
via crawlers). With the advent of the Web of Data, that is, 
semantic data at massive scale IslflGl, these assumptions, in 
general, do not hold anymore. First, links are semantically 
labelled (thanks to RDF triples) thus can be used to orient 
and control the navigation, are generated distributively and 
can be part of any data source. Hence, it has become a 
reality -using the words of Tim Berners-Lee- that anyone 
can say anything about anything and publish it anywhere. 
Second, data sources have a truly distributed nature due to 
their huge size, autonomous generation, and standard RDF 
structure. This makes inconvenient and impractical to re- 
organize them in central repositories as for Web pages. 

^Even though the spec. XLink [t] allows to define links in a 
third page, it was never used massively. 




Figure 1: Classical Web versus Web of Data. Size, 
distributive character, and semantic description of 
data gives navigation a prominent role. 

In this setting, navigation along the nodes of the Web of 
Data, using the semantics stored in each data source, be- 
comes significant. To model these issues, rather than as a 
graph, the Web of Data is better represented as a set of nodes 
plus data describing their semantic structure "hanging" from 
each node (see Fig. [T]). This model permits to better ex- 
press the distributed creation and maintenance of data, and 
the fact that its structure is provided by dynamical and dis- 
tributed data sources. In particular, it reflects the fact that 
at each moment of time, and for each particular agent, the 
whole network of data on the Web is unknown 19 . 

This new scenario calls for new models and languages to 
query and explore this semantic data space. In particular we 
highlight three functionalities: (1) a new type of navigation 
emerges as an important feature, in order to traverse sites 
and data sources; (2) closely tied to it, navigation charts or 
specifications, that is, semantic descriptions of fragments of 
the Web; (3) specification of actions one would like to per- 
form over this data (e.g., retrieving data, sending messages, 
etc.) also becomes relevant. Navigation, specifications of re- 
gions, and actions appear as part of the basic functionalities 
for exploring and doing data management over the Web of 
Data. Ideally, one would like to have a simple declarative 
language that integrates all of them. 

In this paper we present such a language, which we call 
NautiLOD, and show that it can be readily implemented 
on the current Linked Open Data (LOD) network [16]. In 
fact, we introduce the swget tool that exploits current Web 
protocols and work on LOD data. Finally, we explore its dis- 
tributed version and implement an application as proof-of- 
concept to show its feasibility, potentialities and challenges. 
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Figure 2: An excerpt of data that can be navigated from dbpedia: StanleyKubrick. 



NautiLOD by example. To help the reader to get a more 
concrete idea of the language, we present some examples 
using an excerpt of real-world data shown in Fig. [5] (The 
formal syntax and semantics is introduced in Section 3). 

Example 1.1. (Aliases via owlisameAs^ Specify what 
is predicated from Stanley Kubrick in DBPedia and also con- 
sider his possible aliases in other data sources. 

The idea is to have <owl : ScLmeAs>-paths, which start from 
Kubrick's URI in DBPedia. Recursively, for each URI u 
reached in this way, check in its data source the triples 
(u, owl : sameAs, v). Select all v's found. Finally, for each 
of such V, return all URIs w in triples of the form (v, p, w) 
found in v's data source. The specification in NautiLOD is 
as follows: 

(<owl : sameAs>) * /<_> 

where <_> denotes a wild card for RDF predicates. In Fig. 
[2] when evaluating this expression starting from the URI 
dbp: StanleyKubrick we get all the different representations 
of Stanley Kubrick provided by dbpedia. org, f reebase . org 
and linkedmdb.org. From these nodes, the expression <_> 
matches any predicate. The final result is: {dbp:DavidLynch, 

dbp : New York , dbp : FilmEdit ing , Imdb : Producer , Imdb : /f ilm/334 , f b : Path 
of Glory , http : //en . wikipedia . org/wiki/Stanley_Kubrick [ . Note that 

the naive search for Kubrick's information in DBPedia, would 

only give {http://en.wikipedia.org/wiki/Stanley_Kubrick, New York, 
David Lynch, Film Editing}. 

A more complex example, which extends standard naviga- 
tional languages with actions and SPARQL queries is: 

Example 1.2. URIs of movies (and their aliases), whose 
director is more than 50 years old, and has been influenced, 
either directly or indirectly, by Stanley Kubrick. Send by 
email the Wiki pages of such directors as you get them. 

This specification involves influence-paths and aliases as 
in the previous example; tests over the dataset associated to 
a given URI (if somebody influenced by Kubrick is found. 



check if it has the right age), a test expressed in NautiLOD 
using ASK-SPARQL queries; and actions to be taken using 
data form the data source. The NautiLOD specification is: 

(<dbpo : inf luenced>)+ [Test] /Act/<dbpo :director>/ 
/ (<owl : sameAs>) ? 

where the test and the action are as follows: 

Test= ASK ?p <dbpo:birthDate> ?y. FILTER(?y<1961-01-01) . 

Act= sendEmail(?p) [SELECT ?p WHERE {?x <foaf:page> ?p.}]. 

In the expression, the symbol + denotes that one or more 
levels of influence are acceptable, e.g., we get directors like 
David Lynch and Quentin Tarantino. From this set of re- 
sources, the constraint on the age enforced by the ASK query 
is evaluated on the data source associated to each of the re- 
sources already matched. This filter leaves in this case only 
dbp : DavidLynch. At this point, over the elements of this 
set (one element in this case), the action will send via email 
the page (obtained from the SELECT query). The action 
sendEmail, implemented by an ad-hoc programming proce- 
dure, does not influence the navigation process. Thus, the 
evaluation will continue from the URI u =dbp : DavidLynch, 
by navigating the property dbpo : director (found in the 
dataset V obtained by dereferencing u). For example, in 
V we found the triple (u, dbpo: director, dbp: BlueVelvet). 
Then, from dbp : BlueVelvet we launch the final part of 
the expression, already seen in Example |1.1| It can be 
checked that the final result of the evaluation is: (1) the set 
{dbp : BlueVelvet , f b : BlueVelvet}, that is, data about the 
movie Blue Velvet from dbpedia. org and f reebase . org; (2) 
the set of actions done; in this case one email sent. 

Contributions of the paper. The following are main the 
contributions of this paper: 

(1) First: we define a general declarative specification lan- 
guage^ called NautiLOD, whose navigational features ex- 
ploit regular expressions on RDF predicates, enhanced with 
existential tests (based on ASK-SPARQL queries) and ac- 
tions. It allows both: to specify a set of sites that match the 
semantic description, and to orient the navigation using the 



information that these sites provide. Its basic navigational 
features are inspired both by wget and XPath, enhanced with 
semantic specifications, using SPARQL to filter paths, and 
with actions to be performed while navigating. We present a 
simple syntax, a formal semantics and a basic cost analysis. 

(2) Second: we implement a version of the language^ by 
developing the application swget that evaluates NautiLOD 
expressions in a centralized form (at the distinguished ini- 
tial node). Being based on NautiLOD, swget permits to 
perform semantically-driven navigation of the Web of Data 
as well as retrieval actions. This tool relies on the computa- 
tional resources of the initial node issuing the command and 
exploits the Web protocol HTTP. It is readily available on 
the current Linked Open Data (LOD) network. Its limita- 
tion is, of course, the scalability: the traffic of data involved 
could be high, making the navigation costly. 

(3) Third: we implement swget in a distributed environ- 
ment. Based on simple assumptions on third parties (a small 
application that each server should run to join it, and that 
in many ways extends the idea of current endpoints), we 
show the feasibility of such an application that simulates a 
travelling agent, and hint at the powerful uses it can have. 
From this proof-of-concept, we explore the potentialities of 
this idea and its challenges. 

The paper is organized as follows. Section [2]provides a 
quick overview of the Web of Data. In Sectionjs] the Nau- 
tiLOD language is introduced: syntax, semantics and its 
evaluation costs. In Section [4| swget , a centralized imple- 
mentation of NautiLOD is introduced: its architecture, 
pseudo-code and experimental evaluation. Section [5] deals 
with the distributed version of swget, showing the feasibil- 
ity and potentialities of this application. Section [6] discusses 
related work. Finally, in Section [7| we draw conclusions and 
delineate future work. 

2. PRELIMINARIES: THE WEB OF DATA 

This section provides some background on RDF and Linked 
Open Data (LOD) that are at the basis of the Web of Data. 
For further details the reader can refer to fsl pjl . 



RDF. The Resource Description Framework (RDF) is a 
metadata model introduced by the W3C for representing 
information about resources in the Semantic Web. RDF is 
built upon the notion of statement. A statement defines the 
property p holding between two resources, the subject s and 
the object o. It is denote by {s,p, o), and thus called triple in 
RDF. A collection of RDF triples is referred to as an RDF 
graph. RDF exploits Uniform Resource Identifiers (URIs) 
to identify resources. URIs represent global identifiers in 
the Web and enable to access the descriptions of resources 
according to specific protocols (e.g., HTTP). 

2.1 Web of Data - the LOD initiative 

The LOD initiative leverages RDF to publish and inter- 
linking resources on the Web. This enables a new (semantic) 
space called Web of Data. Objects in this space are linked 
and looked-up by exploiting (Semantic) Web languages and 
technologies. LOD is based on some principles, which can 
be seen more as best practices than formal constraints: 

(1) Real world objects or abstract concepts must be as- 
signed names on the form of URIs. 

(2) In particular, HTTP URIs have to be used so that 
people can look them up by using existing technologies. 



(3) When someone looks up a URI, associated information 
has to be provided in a standard form (e.g., RDF). 

(4) Interconnections among URIs have to be provided by 
including references to other URIs. 

An important notion in this context is that of dereference- 
able URI. A dereferenceable URI, represents an identifier of 
a real world entity that can be used to retrieve a represen- 
tation, by an HTTP GET, of the resource it identifies. The 
client can negotiate the format (e.g., RDF, N3) in which it 
prefers to receive the description. 

2.2 Data in the LOD 

Data in the LOD are provided by sites (i.e., servers), 
which cover a variety of domains. For instance, dbpedia . org 
or freebase.org provide cross-domain information, geon- 
ames.org publishes geographic information, pubmed.org in- 
formation in the domain of life-science whereas acm . org cov- 
ers information about scientific publications. 

Theoretically in each server resides an RDF triple-store 
(or a repository of RDF data). In order to obtain informa- 
tion about the resource identified by a URI u, a client has to 
perform an HTTP GET u request. This request is handled 
by the Linked Data server, which answers with a set triples. 
This is usually said to be the dereferencing of u. 

In the Web of Data, resources are not isolated from one 
another, in spirit with the fourth principle of LOD, but are 
linked. The interlinking of these resources and thus of the 
corresponding sites in which they reside forms the so called 
Linked Open Data Cloud^ 

3. A NAVIGATION LANGUAGE FOR THE 
WEB OF DATA 

As we argued in the Introduction, there are data man- 
agement challenges emerging in the Web of Data that need 
to be addressed. Particularly important are: (i) the speci- 
fication of parts of this Web, thus of semantic fragments of 
it; (a) the possibility to declaratively specify the navigation 
and exploit the semantics of data placed at each node of 
the Web; (Hi) performing actions while navigating. To cope 
with this needs, this section presents a navigation language 
for the Web of Data, inspired by two non-related languages: 
wget, a language to automatically navigate and retrieve Web 
pages; and XPath, a language to specify parts of documents 
in the world of semi-structured data. We call it Navigational 
language for Linked Open Data, NautiLOD. 

NautiLOD is built upon navigational expressions, based 
on regular expressions, filtered by tests using ASK-SPARQL 
queries (over the data residing in the nodes that are being 
navigated), and incorporating actions to be triggered while 
the navigation proceeds. NautiLOD allows to: (i) semanti- 
cally specify collections of URIs; (ii) perform recursive navi- 
gation of the Web of Data, controlled using the semantics of 
the RDF data hanging from the URIs that are visited (that 
can be obtained by dereferencing these URIs); (Hi) perform 
actions on specific URIs, as for instance, selectively retrieve 
data from them. 

Before presenting the language, we present in Section [3. 1| 
an abstract data model of the Web of Da ta. Then we present 
the syntax of NautiLOD (Section 3.2), and the formal se- 
mantics (Section |3.3[ ). Finally, we provide a basic cost model 
for the complexity of evaluating NautiLOD expressions. 
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3.1 Data model 

We define a minimal abstract model of the Web of Data 
to highlight the main features required in our discussion. 

Let lA be the set of all URIs and C the set of all liter- 
als. We distinguish between two types of triples. RDF links 
{s,p,o) ^UxUxU that encode connections among resources 
in the Web of Data. Literal triples, (s,p, o) E U x U x £, 
which are used to state properties or features of the resource 
identified by the subject s. Note that the object of a triple, 
in the general case, can be also a blank node. However, here 
we will not consider them to simplify the presentation of the 
main ideas (note also that the usage of blank nodes is dis- 
couraged [l6]). Let T be the set of all triples in the Web of 
Data. The following three notions will be fundamental. 

Definition 3.1 (Web of Data T). Let U and C he 
infinite sets. The Web of Data (overlA and C) is the set of 
triples {s,p, o) inU xU x {U U C). We will denote it by T . 

Definition 3.2 (Description Function P). A func- 
tion V : U ^ associates to each URI u ^ U a subset 
of triples ofT, denoted by V{u), which is the set of triples 
obtained by dereferencing u. 

Definition 3.3 (Web of Data Instance W). A Web 
of Data instance is a pair W = (U, V) , where U is the set of 
all URIs and V is a description function. 

Note that not all the URIs in U are dereferenciable. If a 
URI u G Z// is not dereferenciable then V{u) = 0. 

3.2 Syntax 

NautiLOD provides a mechanism to declaratively: (i) 
define navigational expressions; (ii) allow semantic control 
over the navigation via test queries; (iii) retrieve data by per- 
forming actions as side-effects along the navigational path. 

The navigational core of the language is based on regu- 
lar path expressions, pretty much like Web query languages 
and XPath. The semantic control is done via existential 
tests using ASK-SPARQL queries. This mechanism allows 
to redirect the navigation based on the information present 
at each node of the navigation path. Finally, the language 
allows to command actions during the navigation according 
to decisions based on the original specification and the local 
information found. 



path :: = 


pred (pred) ^ action path/path 




1 (path)? (path)* (path path) path[test] 


pred :: = 


<RDF predicate> | <_> 


test :: = 


ASK-SPARQL query 


action :: = 


procedure [Select-SPARQL query] 



Table 1: Syntax of the NautiLOD language. 



The syntax of the language NautiLOD is defined accord- 
ing to the grammar reported in Table [l] The language is 
based on Paths Expressions, that is, concatenation of base- 
case expressions built over predicates, tests and actions. The 
language accepts concatenations of basic and complex types 
of expressions. Basic expressions are predicates and actions; 
complex expressions are disjunctions of expressions; expres- 
sions involving a number of repetitions using the features of 
regular languages; and expressions followed by a test. The 
building blocks of a NautiLOD expression are: 



1. Predicates. The base case, pred can be an RDF predi- 
cate or the wildcard <_> used to denote any predicate. 

2. Test Expressions. A test denotes a query expression. 
Its base case is an ASK-SPARQL query. 

3. Action Expressions. An action is a procedural specifi- 
cation of a command (e.g., send a notification message, 
PUT and GET commands on the Web, etc.), which 
obtains its parameters from the data source reached 
during the navigation. It is a side-effect, that is, it 
does not influence the subsequent navigation process. 

If restricted to (1) and (2), NautiLOD can be seen as a 
declarative language to describe portions of the Web of Data, 
i.e., set of URIs conform to some semantic specification. 

3.3 Semantics 

NautiLOD expressions are evaluated against a Web of 
Data instance W and a URI u indicating the starting point 
of the evaluation. The meaning of a NautiLOD expression 
is a set of URIs defined by the expression plus a set of actions 
produced by the evaluation of the expression. The resulting 
set of URIs are the leaves in the paths according to the 
NautiLOD expression, originating from the seed URI u. 

For instance, the expression type, evaluated over u, will 
return the set of URIs reachable from u by "navigat- 
ing" the predicate type, that is, by inspecting triples of the 
form (u, type,Ufc) included in Similarly, the expres- 

sion type[q] will filter, from the results of the evaluation 
of type, those URIs Ufc for which the query q evaluated on 
their descriptions P(ufc) is true. Finally, the evaluation of an 
expression type [q] /a will return the results of type[q] and 
perform the action a (possibly using some data from P(ufc)). 

The formal semantics of NautiLOD is reported in Ta- 
ble [2] The fragment of the language without actions follows 
the lines of formalization of XPath by Wadler . Actions 
are treated essentially as side-effects and evaluated while 
navigating. Given and expression, a Web of Data instance 
W = {U,V), and a seed URI u the semantics has the fol- 
lowing modules: 

• £^|[path](u, W): Evaluates the set of URIs selected by 
the navigational expression path starting from the URI 
u in the Web of Data instance W. Additionally, it 
collects the actions associated to each of such URIs. 

• [/|[path](u, W): Defines the set of URIs specified by 
the expression path when forgetting the actions. 

• A[[path]](u, W): Executes the actions specified by the 
evaluation of the navigational expression path. 

• 5'em Jpath] (u, W): Outputs the meaning of the expres- 
sion path, namely, the ordered pair of two sets: the set 
of URIs specified by the evaluation of path; and the 
set of actions performed according to this information. 

Note on some decisions made: Any sensible real implemen- 
tation can benefit from giving an order to the elements of 
the output action set. As far as the formal semantics, at this 
stage we assumed that actions are independent from one an- 
other and that the world W is static during the evaluation 
(to avoid to overload our discussion with the relevant issue 
of synchronization, that is at this point orthogonal to the 
current proposal). Thus, we decided to denote the actions 



S[<p>I 

SI(<P>)-1 

El<_>j 
S[actl 
i;|pathj/path2| 
i?[(path)?I 
£;I(path)*I 
ij^pathjlpathjl 
£;|path[test]] 

f/Ipathl 
^[path] 



u,>V) = {(u',±) I (u,<p>,u'> eP(u)} 

u,W) = {(u',±) I (u',<p>,vi> el?(vi)} 

u,W) = {(u',±) I 3<p>,(u,<p>,u'> e©(u)} 

u,W) = {(vi,act)} 

u, W) = {(u", a) € BIpath^Ku', W) : 3b, (u', b) € S[pathJ(u, W)} 

u,W) = {(M,±)}US[pathI(u,W) 

u, W) = {(u, ±)} U Ur -E^IpathiKu, W) | pathj = path A path^ = path^.^/path 

u,W) = i;|[pathJ(u,W) U^IpathjKu, W) 

u,>V) = {(u',a) € £lpathl(u, W) : test(u') = true} 

u,W) = : 3a,(i;,a) € £;|Ipathl(u,W)} 

u, W) = {Exec(a, v) : {v, a) € S|[path]](u, W)} 



SemIpathI(u,W) = (f/[pathl(u, W), ^Ipathl(u, W)) 



Table 2: Semantics of NautiLOD. The semantics of an expression is composed of two sets: (1) the set of URIs 
of W satisfying the specification; (2) the actions produced by the evaluation of the specification. Exec(a,it) 
denotes the execution of action a over u. 1. indicates the empty action (i.e., no action). 



produced by the evaluation of an expression as a set. It is 
not difficult to see that one could have chosen a list as the 
semantics for output actions. 

3.4 Evaluation of Costs and Complexity 

We present a general analysis of costs and complexity of 
the evaluation of NautiLOD expressions over a Web of Data 
instance W. We can separate the costs in three parts, where 
E are expressions, E action-and-test-free expressions, A ac- 
tions and T tests: 

cost{E, W) = cost(E, W) + cost{A) + cost{T). (1) 

Since actions do not affect the navigation process we can 
treat their cost separately. Besides, in our language, tests 
are ASK-SPARQL queries having a different structure from 
the pure navigational path expressions of the language. Even 
in this case we can treat their cost independently. 

Actions. NautiLOD is designed for acting on the Web 
of Data. In this scenario, the cost of actions has essentially 
two components: execution and transmission. The execu- 
tion cost boils down to the cost of evaluating the SELECT 
SPARQL query that gives the action's parameters. As for 
transmission costs, a typical example is the wget command, 
where the cost is the one given by the GET data command. 

Action-and-test-free. This fragment of NautiLOD can 
be considered essentially as the PF fragment of XPath (lo- 
cation paths without conditions), that is well known to be 
(with respect to combined complexity) NL-complete under 
L-reductions (Thm. 4.3, [lO]). The idea of the proof is sim- 
ple: membership in NL follows from the fact that we can 
guess the path while we verify it in time L. The hardness 
essentially follows from a reduction from the directed graph 
reachability problem. Thus we have: 

Theorem 3.4. With respect to combined complexity, the 
action-and-test-free fragment of NautiLOD is JSilj-complete 
under L-reductions. 

Combined refers to the fact that the input parameters are 
the expression size and the data size. Note that what really 
matters is not the whole Web (the data) , but only the set of 
nodes reachable by the expression. Thus it is more precise to 
speak of expression size plus set-of-visited nodes size. The 
worst case is of course the whole size of the Web. 



Tests. The evaluation of tests (i.e., ASK-SPARQL queries) 
has a cost. This cost is well known and one could choose 
particular fragments of SPARQL to control it [2l]. How- 
ever, tests will possibly reduce the size of the set of nodes 
visited during the evaluation. Thus the cost{E, W) has to 
be reduced to take into account the effective subset of nodes 
reachable thanks to the filtering performed by the tests. Let 
Wt be W when taking into account this filtering. We have: 

cost{E, W) = cost(E, Wt) + cost{A) + cost{T). (2) 

Section [4?2l will discuss some examples on real world data by 
underlining the contribution of each component of the cost. 

Final Considerations. In a distributed setting, with 
partially unknown information and a network of almost un- 
bound size, the notion "cost of evaluating an expression e" 
appears less significant than in a controlled centralized envi- 
ronment. In this scenario, a more pertinent question seems 
to be: "given an amount of resources r and the expression 
e, how much can I get with r satisfying e ?". This calls for 
optimizing (according to some parameters) the navigation 
starting from a given URI according to equation ([2|. 

4. IMPLEMENTATION OF NautiLOD 

This section deals with swget, a tool implementing Nau- 
tiLOD. The tool swget implements all the navigational fea- 
tures of NautiLOD, a set of actions centred on retrieving 
data, and adds (for practical reasons) a set of ad-hoc op- 
tions for further controlling the navigation from a network- 
oriented perspective (e.g., size of data transferred, latency 
time) that today's are not yet found as RDF statements. 

swget has been implemented in Java and is available as: 
(i) di, developer release, which includes a command-line tool 
that is easily embeddable in custom applications; (ii) an 
end user release, which features a GUI. Further details, ex- 
amples, the complete syntax along with the downloadable 
versions are available at the swget 's Web site|^ 

4.1 Architecture 

The high level architecture of swget is reported in the left 
part of Fig. |3] The Command interpreter receives the input, 
i.e., a seed URI, a NautiLOD expression and a set of options. 

^http : / / swget . wordpress . com 



The input is then passed to the Controller module, which 
checks if a network request is admissible and possibly passes 
it to the Network Manager. A request is admissible if it com- 
plies with what specified by the NautiLOD expression and 
with the n etwork-related navigation parameters (see Sec- 
tion |4TT]). The Network Manager performs HTTP GET 
requests to obtain streams of RDF data. These streams are 
processed for obtaining Jena RDF models, which will be 
passed to the Link Extractor. The Link Extractor takes in 
input an automaton constructed by the NautiLOD inter- 
preter and selects a subset of outgoing links in the current 
model according to the current state of the automaton. The 
set is given to the Controller Module, which starts over the 
cycle. The execution will end either when some navigational 
parameter is satisfied or when there are no more URIs to be 
dereferenced. 
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Figure 3: swget architecture and scenario. 

4.1.1 Network-based controlled navigation 

NautiLOD is designed to semantically control the nav- 
igation. However, it can be the case that a user wants to 
control the navigation also in terms of network traffic gen- 
erated. A typical example is a user running swget from 
a mobile device with limited Internet capabilities. This is 
why swget includes features to add more control to the nav- 
igation through the parameters reported in Table [3] Each 
option is given in input to swget as a pair (param, value) . 

Table 3: Network params to control the navigation 



Parameter 


Value 


Meaning 


maxDerTriples 


int 


max. number of triples al- 
lowed in each dereferencing 


saveGraph 


boolean 


Save the graphs dereferenced 


maxSize 


int 


traffic limit (in MBs) 


timeoutDer 


long 


connection time-out 


timeout 


long 


total time-out 


domains 


List<String> 


trusted servers 



To illustrate a possible scenario where the navigation can 
be controlled both from a semantic and network-based per- 
spective consider, the following example. 

Example 4.1. (Controlled navigation) Find informa- 
tion about Rome, starting from its definition in DBPedia and 
includes other possible definitions of Rome linked to DBPe- 
dia but only if their description contains less than 500 triples 
and belongs to DBPedia, Freebase or The New York Times. 



swget < dbp:Rome> (<owl : sameAs>) * -saveGraph 
-domains {dbpedia . org , rdf . freebase . com , 
data.nytimes.com} -maxDerTriples 500 



The command, besides the NautiLOD expression, contains 
the -domains and -maxDerTriples parameters to control 
the navigation on the basis of the trust toward information 
providers and the number of triples, respectively. 



4.2 Evaluating nautiLOD expressions 

Given a NautiLOD expression e it is possible to build an 
automaton that can recognize NautiLOD expressions. The 
transitions between states of the automaton implements the 
navigation process. 

4.2.1 The swget Navigation Algorithm 

The swget controlled navigation algorithm is reported in 
Algorithm [l] Moreover, Table 4 describes the high level 
primitives used in the pseudo-code to interact with the au- 
tomaton. 

Algorithm 1: swget pseudo-code 

Input : e=NAUTlLOD expression; seed=URI; par=Parms<n , v> 
Output: set of URIs and literals conform to e and par; 

1 a = buildAutomaton(e) ; 

2 addLookUpPair (seed, a. getlnitial () ) ; 

3 while (3 p=<uri , state> to look up and checkNet (par)=OK) do 

4 desc=getDescription(p .uri) ; 

5 if (a. isFinaKp. state)) then 

6 addToResult (p.uri) ; 

7 if (not alreadyLookedUp(p)) then 

8 setAlreadyLookedUp(p) ; 

9 if (t=getTest (p . state) 7^ and evalT(t ,desc)=true) then 

10 s=a.nextState (p . state ,t) ) ; 

11 addLookUpPair (p. uri, s); 

12 if (act=getAction(p.state)7^ 0) then 

13 if (evalA(act . test ,desc)) then exeC (act . cmd) ; 

14 s=a.nextState(p.state,act)); 

15 addLookUpPair (p. uri, s); 

16 out=navigate(p,a,desc); 

17 for (each URI pair p'=<uri,state> in out) do 

18 addLookUpPair(p'); 

19 for (each literal pair lit=<literal , state> in out) do 

20 if (a. isFinaldit .state)) then 

21 addToResult (lit .literal); 



Function navigate(exp,a,desc) 



Output: List of <uri,state> and <literal ,state> 

1 for (each pred in a.nextP(p. state)) do 

2 next S=a. next St ate (p. state , pred) ; 

3 query= "SELECT ?x WHERE 

{{ ?x pred p.uri} UNION{ p.uri pred ?x}}"; 

4 for (each res in evalQ (query, desc)) do 

5 addOutput (res , next S) ; 



6 return Output; 



Table 4: Primitives for accessing the automaton. 



Primitive 


Behaviour 


getlnitial 


returns the initial state qo 


nextP(q) 


returns the set {cr | 5(g, cr) = gi} of tokens (i.e., 
predicates) enabling a transition from q to qi 


getTest(q) 


returns the test to perform into the current au- 
tomaton state 


getAction(q) 


returns the action to perform into the current 
automaton state 


nextState(q,(j) 


returns the state that can be reached from q by 
the token a 


isFinal(q) 


returns TRUE if q is an accepting state 



The algorithm takes as input a seed URI, a NautiLOD ex- 
pression and a set of network parameters, and returns a set 
of URIs and literals conform to the expression and the net- 
work parameters. For each URI involved in the evaluation, 
possible tests (line 9) and actions (line 12) are considered. 

The procedure navigate is exploited to extract links (line 
3) from a resource identified by p.uri toward other re- 
sources. According to the Linked Data initial proposal [2] 
[section on browsable graphs] p.uri may appear either as 
the subject or the object of each triple. 




a[l-l] a[l-2] a[l-3] a[l-4] a[l-5] a[noAT] 




a[l-l] a[l-2] a[l-3] a[l-4] a[l-5] a[noAT] 




a[l-2] a[l-3] 



a[l-5] a[noAT] 



(a) Time (sees) 



(b) #Dereferenced URIs 



(e) # Triples retrieved 



Figure 4: Evaluation of swget. Each expression has been executed 4 times. Average results are reported. 



4.3 Experimental Evaluation 

To show real costs of evaluating the different components 
of swget expressions over real-world data, we choose two 
complex expressions (shown in Fig. [sj to be evaluated over 
the Linked Open Data network. We report the results of 
swget in terms of execution time (t), URIs dereferenced (d) 
and number of triples retrieved (n). Each expression has 
been divided in 5 parts (i.e., ai,i G {1..5}). They have been 
executed as whole (i.e., cr[i_5]) and as action-and-test-free 
expressions (i.e., cr^noAT])^ which correspond to Ei and E2, 
respectively (see Section [3^ . Moreover, the various sub- 
expressions (i.e., a[i-i],i G {i--4}) have also been executed. 
This leads to a total of 12 expressions. For each expression, 
the corresponding sub- Web has been locally retrieved. That 
is, for each reachable URI the corresponding RDF graph 
has been locally stored. The aim of the evaluation is to 
investigate how the various components in the cost model 
presented in Section [3.4| affect the parameters t, d and n. 



seed URI: =http : //dbp : Stanley_Kubrick 

(t1 : <dbpo : inf luenced><3> 

0-2: [ASK ?p<dbpo:birthData> ?y . FILTER( ?y > 1961-01-01) ] 
0-3 : { sendEmail ( ?p ) [ SELECT ?p WHERE { ?x <f oaf : name> ?p . } ] } 
cr4 : <dbpo: director> 
(t5 : <owl : s ameAs> ? 

seed URI:=http: //dbp: Italy 

cjI: <dbpo : homeTown> 

a2: [ASK ?person <rdf:type> <dbpo : Person> . 

?person <rdf:type> <dbpo:MusicalArtist> ] 
(t3 : <dbpo : birthPlace> 

cr4: [ASK ?t <dbpo:populationTotal> ?p. filter( ?p <15000 ) ] 
cr5 : <owl : sameAs>* 

Figure 5: Expressions used in the evaluation 

The results of the evaluation, in logarithmic scale, are re- 
ported in Fig. |4ja)-(c). In particular, in the X axis are 
reported from left to right: the 4 sub-expressions, the full 
expression (i.e., cr[i_5]) and the action-and-test-free expres- 
sion (i.e., a[noAT])- Note that in some cases, the number of 
results is higher than the number of dereferenced URIs re- 
ported because not all the results were dereferenceable URIs. 

The first expression (^i) starts by finding people influ- 
enced by Stanley Kubrick up to a level 3 (subexpr. cr[i_i]). 
This operation requires about 61 sees., for a total of 221 
URIs dereferenced. On the description of each of these 221 
URIs, an ASK query is performed to select only those en- 
tities that were born after the 1961 (subexpr. cr[i_2]). The 
execution time of the queries is of about 4 sees, (i.e., ^ 
0.02 sees., per query). Hence, 31 entities have been selected. 
At this point, an action is performed on the descriptions of 



these 31 entities by selecting their <f oaf :ncLme> to be sent 
via email (subexpr. cr[i_3]). In total, the select, the render- 
ing of the results in an HTML format and the transmission 
of the emails cost about 25 sees. The navigation continues 
from the 31 entities before the action to get movies through 
the property <dbpo : director> (subexpr. cr[i_4]). The cost 
is of about 34 sees., for a total of 136 movies. Finally, for 
each movie only one level of possible additional descriptions 
is searched by the <owl : saineAs> property (the whole expr. 
cr[i_5]) whose cost is 1638 sees., for a total of 409 new URIs 
available from multiple servers (e.g., linkedmdb.org, free- 
base . org) of which only 289 were dereferenceable. 

By referring to the cost model in Section [3^ we have that 
cost{Ei,W) = cost (E I, Wt^) + cost (Ai) + cost (Ti) = 1763. 
Here, the factor cost{Ai) ~ 25 sees., whereas cost{Ti) ^ 4 
sees., and cost{Ei,WTi) — 1738 sees. If we consider the 
test-and-action-free expression executed over the whole Web 
of Data (i.e., W), we have that cost(Ei,W) 20018 sees. 
Note that the ASK queries costs about 4 sees., and permits 
to reduce the portion of the Web of Data navigated by Ei, 
which enables to save about 20018 — 1738 = 18280 sees. 
Such a larger difference in the execution times is justified 
by the fact that the 222 initial URIs, selected by (J[i-i] are 
not filtered in the case of (Ei, W) and then cause an larger 
amount of paths to be followed at the second level. Indeed, 
the total number of dereferenced URIs for (Ei , W) is 6053 
while for {Ei, Wti) is 646 with about 660K triples retrieved 
in the first case and 125K in the second case. 

The second expression {E2) starts by navigating the prop- 
erty <dbpo : homeTown> to find entities living in Italy (subexpr. 
c"[i-i]) with an execution time of about 84 sees., and a total 
of 400 dereferenced URIs, one seed and 399 URIs of enti- 
ties. On the description of each of these 399 URIs, an ASK 
query filters entities that are of type <dbpo :Person> and 
<dbpo :MusicalArtist> (subexpr. cr[i_2]). Hence, 399 ASK 
queries are performed for a total of about 3.8 sees., with 
an average time per query of 0.01 sees., to select 156 enti- 
ties. For these entities, the navigation continues through the 
property <dbpo : birthPlace> to find the places where these 
people were born (subexpr. cr[i_3]), which costs about 101- 
87=14 sees. In total, 43 new URIs have been reached. The 
navigation continues with a second ASK query to select only 
those places in which live less than 15000 habitants (subexpr. 
cr[i_4]). The cost of performing 43 ASK queries on the re- 
sults of the previous step is of about 3 sees. Here 5 places are 
selected. Finally, for each of the 5 places additional descrip- 
tions are searched by navigating the <owl : saineAs> property 



(the whole expr. cr[i_5]). This allows to reach a total of 29 
URIs, some of which are external to dbpedia. org. The cost 
for this operation is of about 57 sees. 

As for the cost, we have cost (^2, W) = cost{E2,yVT2) + 
cost{A2) + cost{T2) = 161. The factor cost{A2) = since 
E2 does not contain any action whereas cost{T2) — 6 sees. 
Hence, cost{E2,yVT2) = 155 sees. The cost of the test-and- 
action-free expression (i.e., E2) over W is cost(£^2,W) ^ 
1600 sees., for a total of 1277 dereferenced URIs. This is 
because the expression is not selective since it performs a 
sort of "semantic" crawling only based on RDF predicates. 
In fact, the number of triples retrieved (see Fig. |4jc)) is 
almost three times higher than in the case of the expression 
with tests. By including the tests, the evaluation of E2 is 
1445 sees., faster. 

5. A PROPOSAL FOR DISTRIBUTED swget 

This section presents and overview of Distributed swget 
(Dswget), which has the peculiarity that the processing Nau- 
TiLOD expressions occurs in a cooperative manner among 
LOD information providers. The tool has been implemented 
and tested on a local area network. 

5.1 Dswget: making LOD servers cooperate 

swget enables controlled navigation but it heavily relies on 
the client that initiates the request. However, one may think 
of the Linked Data servers storing RDF triples as to peers in 
a Peer-to- Peer (P2P) network, where links are given by URIs 
in RDF triples. For instance (dbp : Rome, owl : same As, f b : Rome) 
links dbpedia.org with freebase.org. Indeed, there are 
some differences w.r.t. a traditional P2P network. First, 
Linked Data servers are less volatile than peers. Second, 
it is reasonable to assume that the computational power 
of Linked Data servers is higher than that of a traditional 
peers. This enables to handle a higher number of connec- 
tions with the associated data. 

Our proposal is to leverage the computational power of 
servers in the network to cooperatively evaluate swget com- 
mands. This enables to drastically reduce the amount of 
data transferred. In fact, data is not transferred from servers 
to the client that initiates the request (in response to HTTP 
GETs). Servers will exchange swget commands plus some 
metadata and operate on their data locally. This can be 
achieved by installing on each server in the network a Dswget 
engine and coordinating the cooperation by an ad-hoc dis- 
tributed algorithm. 



Table 5: Primitives of Dswget 



Primitive 


Behaviour 


sendResults 


sends to the original cHent (partial) results, 
which are URIs (line 14) and literals (line 18) 


f wdToServers 


forwards to other servers, the initial client ad- 
dress, the NautiLOD expression and a set of 
pairs <URI , A State>. For each pair, the com- 
putation on a URI will be started from the cor- 
responding A state 



A Dswget command is issued by a Dswget client to the 
server to which the seed URI belongs. Each server involved 
in the computation will receive, handle and forward com- 
mands and results by using the Procedure |handle| Note 
that in this procedure there are calls to some primitives re- 
ported in Table 4 and to the function navigate described 
in Section [4.2. 1[ The specific primitives needed by Dswget 
are reported in Table [5] 



Procedure handle ( cl i ent_ id, e, URIs, metadata) 

Input: client_id=address of the client; e=NAUTlLOD 
expression; URIs=set of pairs <URI,A state>; 
metadata=additional data (e.g., current state of the 
automaton, request id) 

1 a=bu i 1 d Aut omat on ( e ) ; 

2 for (each p=<uri , state> in URIs) do 



3 desc=getDescription(p .uri) ; //local call no deref. needed 

4 if (not alreadyLookedUp(p)) then 

5 setAlreadyLookedUp(p) ; 

6 if (t=getTest (p. state)^^ and evalT(t ,desc)=true) then 

7 s=a.nextState (p . state ,t) ) ; 

8 addLookUpPair(p.uri,s); 

9 if (act=getAction(p . state) 7^ 0) then 

10 if (evalA(act . test ,desc)) then exeC (act . cmd) ; 

11 out=navigate(p,a,desc); 

12 for (each URI pair p'=<uri,state> in out) do 

13 if (a. isFinaKp' . state)) then 

14 addtoResultsCp' .uri) ; 

15 else addLookUpPair (p' ) ; 

16 for (each literal pair lit=<literal , state> in out) do 

17 if (a. isFinaldit . state) ) then 

18 addToResult (lit .literal); 



19 sendResults (client_id) ; 

20 fwdToServers(client_id,e); 



5.1.1 A Running example 

To see an example of how Dswget works, consider the fol- 
lowing request originated from a Dswget client: 

Example 5.1. (^Dswget ) Starting from DBPedia, find cities 
with less than 15000 persons, along with their aliases, in 
which musicians, currently living in Italy, were horn. 

The Dswget command is reported in Fig. [6] which also re- 
ports a possible Dswget interaction scenario. On each linked 
data server a Dswget engine has been installed. Each server 
exposes a set of dereferenceable URIs for which the corre- 
sponding RDF descriptions are available. RDF data enables 
both internal references (e.g., dbp: Rome and dbp:uril) and 
external ones (e.g., fb:Enrico and geo: Paris). 

In Fig. [6] references between URIs are represented by dot- 
ted arrows. When not explicitly mentioned, it is assumed 
that the reference occurs on a generic predicate. The au- 
tomaton associated to this expression, having q4 and q5 as 
accepting states, is also reported. The state (s) of the au- 
tomaton on which a server is operating is (are) reported in 
grey. Dswget protocol messages have been numbered to em- 
phasize the order in which they are exchanged. 

The command along with some medatada (e.g., the ad- 
dress of the client) is issued by the client's Dswget engine 
toward the server to which the seed URI belongs (i.e., dbpe- 
dia. org in this example). The Dswget engine at this server, 
after locally building the automaton, starts the processing of 
the NautiLOD expression at the state qO. It obtains from 
its local RDF store, the description of Rome P(dbp:Roine) 
and looks for URIs having dbpo : hometown as a predicate. In 
Fig[6] the URI fb: Enrico satisfies this pattern. The Dswget 
engine at dbpedia.org performs the first transition of state, 
that is, ^(qO,crl) = ql. The automaton does not reach a 
final state, and then the process has to continue. Since the 
URI f b : Enrico belongs to another server, the Dswget engine 
at dbpedia.org, communicates with that at freebase.org 
by seeding the initial NautiLOD expression, the URI for 
which freebase.org is involved in the computation (i.e., 
fb:Enrico) and the current state of the automaton. In the 
case in which multiple URIs have to be sent, they are packed 
together in a unique message. 
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Figure 6: Distributed Dswget interaction scenario. 



With a similar reasoning the request reaches the Dswget 
engine at geonames . org, which checks if it is possible to 
reach the next state of the automaton starting from the URI 
passed by f reebase . org. It has to check on P(geo : Solarolo) 
if the query represented by a a can be satisfied, that is, 
whether this city has less than 15K habitants. Then, the 
state q4 is reached, which is a final state. The Dswget en- 
gine at geonames . org contacts directly the Dswget engine of 
the client that issued the request and send the result (i.e., 
the URI geoiSolarolo). The address of the client is passed 
at each communication among Dswget engines. 

Note that the automaton has another final state, that 
is, q5 that can be reached if there exist some triples in 
P(geo : Solarolo) having an owlisameAs predicate. Such 
a triple is (geoiSolarolo, owlisameAs, yagorSolarolo). 
Therefore, the Dswget engine at geonames . org sends to the 
engine at yago.org the URI in the object of this triple, the 
expression and the current state of the automaton. Here, as 
the automaton is in a final state, the Dswget engine sends to 
the client the result and continues the process. In this case 
since in P(yago : Solarolo) there are no more triples having 
owlisameAs as predicate, the process ends. 

5.2 Dswget Design issues: an overview 

In designing Dswget several issues, typical of the distributed 
systems, have been faced. Here we briefly report on the main 
of them without getting into too technical details. 

In the Web of Data, a client in order to get information 
about a resource issues an HTTP GET request toward the 
HTTP server where the resource is hosted. In the stan- 
dard case, the HTTP protocol offers a blocking semantics 
for its primitives, which means that once a request is is- 
sued the client has to wait for an answer or until a time- 
out. In Dswget, since engines exchange messages and data 
in a P2P fashion, a blocking semantics for communications 
would block the whole execution. To face this issue, specific 
asynchronous communication primitives and a jo6 delegation 
mechanism have been implemented. With job delegation we 
mean that the sending Dswget engine delegates part of the 
execution and evaluation of a (sub)NAUTiLOD expression 
to the receiving engine(s). In this respect, since a request, 
through the mechanism of job delegation is spread among 
multiple Dswget engines it is necessary to handle the ter- 



mination of requests to avoid to keep consuming resources 
in an uncontrolled way. Dswget tackles this issue from two 
different perspectives: 

(1) Loop detection: each Dswget engine keeps track, for 
each request, of each URI along with the state of the au- 
tomaton on which it has been processed. 

(2) Termination: this problem can be addressed by each 
Dswget engine which, for each request it receives informs the 
client that initially issued the request about the fact that it 
has operated on this request and whether it has delegated 
other Dswget engines. Then, the client can keep track of the 
list of the active engines on a particular request. The Dswget 
engine may additionally send back to the client the state of 
the automaton on which it is operating, thus enabling the 
client to know how far the execution is from a final state. 

6. RELATED WORK 

Many of the ideas underlying our proposal have been around 
in particular settings. We owe inspiration to several of them. 

Navigation and specification languages of nodes in a graph 
have a deep research background. Nevertheless, most of its 
developments assume that data is stored in a central repos- 
itory (e.g. Web query languages 9 , XPath, navigational 
versions of SPARQL |22[ [l]). They were inspiration for the 
navigational core of NautiLOD. 

Specification (and retrieval) of collections of sites was early 
addressed, and a good example is the well known tool wget. 
Besides being non-declarative, it is restricted to almost purely 
syntactic features. At semantic level. Hart et al. 18 pro- 
posed LD Spider, a crawler for the Web of Data able to re- 
trieve RDF data by following RDF links according to dif- 
ferent crawling strategies. They have little flexibility and 
are not declarative. The execution philosophy of wget was 
a source of inspiration for the incorporation of actions into 
NautiLOD and to the design of swget. 

Distributed data management has been explored and im- 
plemented by P2P and similar approaches ^26^. For RDF, 
RDFPeers [5] and YARS2 uses P2P to answer RDF queries. 
Systems for distributed query processing on the Web have 
also been devised, e.g. DIASPORA [24]. Our distributed 
version of swget borrows some ideas from these approaches. 

Finally, it is important to stress the fact that there is a 



solid body of work on query processing and navigation on 
the Web of Data. Three hnes of research can be identified: 

(1) Load the desired data into a single RDF store (by 
crawling the LOD or some sub-portions) and process queries 
in a centralized way. There is a large list of Triple Stores 17 . 
There have been also developments in indexing techniques 
for semantic data. Swoogle [s], Sindice ^ and Watson 
[g] among the most successful. Recently, Hart et al. 13 
proposed an approximate index structure for summarizing 
the content of Linked Data sources. 

(2) Process the queries in a distributed form by using a 
federated query processor. DARQ 23 and FedX 25 pro- 
vide mechanisms for transparently query answering on mul- 
tiple query services. The query is split into sub-queries that 
are forwarded to the individual data sources and their re- 
sult processed together. An evaluation of federated query 
approaches can be found in p^. 

(3) Extend SPARQL with navigational features. The 
SERVICE feature of SPARQL 1.1 and proposals hke the one 
of Hartig et al. [15] extend the scope of SPARQL queries with 
navigational features [T5][l4]. The system SQUIN, based on 
link- traversal, a query execution paradigm that discover on 
the fly data sources relevant for the query, permits to auto- 
matically navigate to other sources while executing a query. 

As it can be seen, our approach has a different depar- 
ture point: it focuses on navigational functionalities, thus 
departing from querying as in (2); emphasizes specification 
of autonomous distributed sources, as opposed to (1); uses 
SPARQL querying to enhance navigation, while (3) pro- 
ceeds in the reverse direction; and incorporates actions that 
in some sense generalize procedures implicit in the evalua- 
tion over the Web (e.g., "get data" in crawlers and "return 
data" in query languages). 

7. CONCLUSIONS AND FUTURE WORK 

We presented a language to navigate, specify fragments 
and perform actions on the Web of Data. It explicitly ex- 
ploits the semantics of the data "stored" at each URL We 
implemented it in a centralized setting to run over real- world 
data, namely the LOD network, showing the benefits it can 
bring. We also developed a distributed version as proof-of- 
concept of its feasibility, potentialities and challenges. 

The most important conclusion we can draw from this re- 
search and development is that the semantics given by RDF 
specifications can be used with profit to navigate, specify 
places and actions on the Web of Data. We presented a lan- 
guage that can be used as the basis for the development of 
agents that get data; navigate and report while navigating; 
and that can work immediately over LOD. 

A second relevant finding we would like to report here, 
are the limitations found to take full advantage of the lan- 
guage and tools we developed. They refer essentially to (1) 
lack of standards in the sites regarding the dereferencing of 
data; (2) lack of standard RDF metadata regarding prop- 
erties of the sites themselves (e.g., provenance, summary of 
contents, etc.); (3) weak infrastructure to host delegation of 
execution and evaluation (of the language) to permit distri- 
bution. Tackling these issues can be considered as our wish 
list to leverage the Web of Data. 
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